Computer Science Technical Report Approximating a Policy Can be Easier Than Approximating a Value Function
نویسنده
چکیده
Value functions can speed the learning of a solution to Markov Decision Problems by providing a prediction of reinforcement against which received reinforcement is compared. Once the learned values relatively reect the optimal ordering of actions, further learning is not necessary. In fact, further learning can lead to the disruption of the optimal policy if the value function is implemented with a function approximator of limited complexity. This is illustrated here by comparing Q-learning (Watkins, 1989) and a policy-only algorithm (Baxter & Bartlett, 1999), both using a simple neural network as the function approximator. A Markov Decision Problem is shown for which Q-learning oscillates between the optimal policy and a sub-optimal one, while the direct-policy algorithm converges on the optimal policy.
منابع مشابه
Approximating the step change point of the process fraction non conforming using genetic algorithm to optimize the likelihood function
Control charts are standard statistical process control (SPC) tools for detecting assignable causes. These charts trigger a signal when a process gets out of control but they do not indicate when the process change has begun. Identifying the real time of the change in the process, called the change point, is very important for eliminating the source(s) of the change. Knowing when a process has ...
متن کاملThe Structure of Bhattacharyya Matrix in Natural Exponential Family and Its Role in Approximating the Variance of a Statistics
In most situations the best estimator of a function of the parameter exists, but sometimes it has a complex form and we cannot compute its variance explicitly. Therefore, a lower bound for the variance of an estimator is one of the fundamentals in the estimation theory, because it gives us an idea about the accuracy of an estimator. It is well-known in statistical inference that the Cram&eac...
متن کاملA Multiprocessor System with Non-Preemptive Earliest-Deadline-First Scheduling Policy: A Performability Study
This paper introduces an analytical method for approximating the performability of a firm realtime system modeled by a multi-server queue. The service discipline in the queue is earliestdeadline- first (EDF), which is an optimal scheduling algorithm. Real-time jobs with exponentially distributed relative deadlines arrive according to a Poisson process. All jobs have deadlines until the end of s...
متن کاملError bounds in approximating n-time differentiable functions of self-adjoint operators in Hilbert spaces via a Taylor's type expansion
On utilizing the spectral representation of selfadjoint operators in Hilbert spaces, some error bounds in approximating $n$-time differentiable functions of selfadjoint operators in Hilbert Spaces via a Taylor's type expansion are given.
متن کامل